14 research outputs found

    A 2-means Clustering Technique for Unsupervised Spam Filtering

    Get PDF
    Unsolicited commercial e-mail, or “Spam”, implies a waste of network bandwidth and waste of human effort in internet and mobile phones communication. It is also a hard problem to distinguish legitimate from spam emails. The majority of the proposed algorithms use supervised learning techniques. Unfortunately, these approaches have the drawback of training over a large amount of manually and costly tagged email corpora. In this paper, we present an unsupervised method to address the problem of filtering spam emails without the need of training over such corpora. Using a 2-means clustering technique we perform a 2-way classification. To overcome the serious complications imposed by the large dimensionality of the data, the algorithm first transforms the data into a low dimensional component space applying a Principal Component Analysis over the data and then performs clustering on them.  The method was proved to be promising when evaluated over the publicly available corpus, called “SpamAssasin”, which is provided by the Open Project for evaluation purposes. The achieved performance is comparable to the performance of systems based on supervised learning techniques

    Exploiting Class Label Frequencies for Text Classification

    Get PDF
    Document classification is an example of Machine Learning (ML) in the form of Natural Language Processing (NLP). By classifying text, we are aiming to assign one or more classes or categories to a document, making it easier to manage and sort. In the vast majority of document classification techniques a document is represented as a bag of words consisting of all the individual terms making up the document together with the number of times each term appears in the document. The number of term occurrences is known as local term frequencies and it is very common to make use of the local term frequencies at the price of some added information in the classification model. In this work, we extend our previous work on medical article classification [1,2] by simplifying the weighting scheme in the ranking process using class label frequencies to device a simple weighting formula inspired from traditional information retrieval task. We also evaluate the proposed approach using more research experimental data.  The method we propose here, called CLF KNN first, it uses a lexical approach to identify frequency terms in the document texts and then, it uses this information coupled with class label information in corpus in a sophisticated way to devise a weighting ranking scheme in classification decision process. The evaluation experiments on two collections: The Ohsumed collection of medical documents and the 20 Newsgroup messages collection, show that the proposed method significantly outperforms traditional KNN classification

    A Weighted Maximum Entropy Language Model for Text Classification

    Get PDF
    Abstract. The Maximum entropy (ME) approach has been extensively used for various natural language processing tasks, such as language modeling, part-of-speech tagging, text segmentation and text classification. Previous work in text classification has been done using maximum entropy modeling with binary-valued features or counts of feature words. In this work, we present a method to apply Maximum Entropy modeling for text classification in a different way it has been used so far, using weights for both to select the features of the model and to emphasize the importance of each one of them in the classification task. Using the X square test to assess the contribution of each candidate feature from the obtained X square values we rank the features and the most prevalent of them, those which are ranked with the higher X square scores, they are used as the selected features of the model. Instead of using Maximum Entropy modeling in the classical way, we use the X square values to weight the features of the model and give thus a different importance to each one of them. The method has been evaluated on Reuters-21578 dataset for test classification tasks, giving very promising results and performing comparable to some of the "state of the art" systems in the classification field

    Extracting Collocations in Modern Greek Language

    No full text
    Abstract. In this paper we describe and apply two statistical methods for extracting collocations from text corpora written in Modern Greek. The first one is the mean and variance method which calculates “offsets ” (distances) between words in a corpus and looks for patterns of distances with low spread. The second method is based on the X 2 test. Such an approach seems to be more flexible because it does not assume normally distributed probabilities of the words in the corpus. The two techniques produce interesting collocations that are useful in various applications e.g. computational lexicography, language generation and machine translation.

    Using WordNet Lexical Database and Internet to Disambiguate Word Senses

    No full text
    Abstract. The term “knowledge acquisition bottleneck ” has been used in Word Sense Disambiguation Tasks (WSDTs) to illustrate/express the problem of the lack of large tagged corpora. In this paper, an automated WSDT is based on text corpora extracted / collected from Internet web pages. First, the disambiguation for the sense of a word, in a context, is based on the use of its definition and the definitions of its direct hyponyms in the WordNet to form queries for searching the Internet. Then, the “sense-related examples”, in other words the collected answers / information, are used to disambiguate the word’s sense in the context. A (similarity) metric is used to calculate the similarity between the context and the “sense-related examples ” and the word is assigned the sense of the most similar example with the context. Some experiments are briefly described and the evaluation of the proposed method is discussed.

    A Weighted Maximum Entropy Language Model for Text Classification

    No full text
    Abstract. The Maximum entropy (ME) approach has been extensively used for various natural language processing tasks, such as language modeling, part-of-speech tagging, text segmentation and text classification. Previous work in text classification has been done using maximum entropy modeling with binary-valued features or counts of feature words. In this work, we present a method to apply Maximum Entropy modeling for text classification in a different way it has been used so far, using weights for both to select the features of the model and to emphasize the importance of each one of them in the classification task. Using the X square test to assess the contribution of each candidate feature from the obtained X square values we rank the features and the most prevalent of them, those which are ranked with the higher X square scores, they are used as the selected features of the model. Instead of using Maximum Entropy modeling in the classical way, we use the X square values to weight the features of the model and give thus a different importance to each one of them. The method has been evaluated on Reuters-21578 dataset for test classification tasks, giving very promising results and performing comparable to some of the “state of the art ” systems in the classification field. 1
    corecore